home *** CD-ROM | disk | FTP | other *** search
Wrap
aaaavvvvaaaaiiiillllmmmmoooonnnn((((5555)))) aaaavvvvaaaaiiiillllmmmmoooonnnn((((5555)))) NNNNAAAAMMMMEEEE availmon - overview of system availability monitoring facilities DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN The availability monitor (_a_v_a_i_l_m_o_n) is a set of programs that are integrated with SGI Embedded Support Partner (a.k.a ESP; see _e_s_p(5) for more details) to collectively monitor and report the availability of a system and the diagnosis of system crashes. For unexpected reboots, availmon identifies the cause of the reboot by gathering information from diagnostic programs such as _i_c_r_a_s_h(1M), which includes results from the FRU analyzer when available, and syslog (see _s_y_s_l_o_g(3C)), and system configuration information from _c_o_n_f_i_g_m_o_n(1M), _v_e_r_s_i_o_n_s(1M), _h_i_n_v(1M) and _g_f_x_i_n_f_o(1G). Availmon can send availability and diagnostic information to various locations, depending on configuration; it can provide local system availability statistics and reboot history reporting. All availmon capabilities are configurable from SGI ESP User Interface. Availmon, by default will not automatically send availmon reports on reboot. In all cases, the AAAAuuuuttttoooommmmaaaattttiiiicccc eeee----mmmmaaaaiiiillll ddddiiiissssttttrrrriiiibbbbuuuuttttiiiioooonnnn flag must be enabled for availmon to send reports. Availmon reporting centers around events. Any system reboot is an availmon event, whether a controlled shutdown or an "unscheduled" reboot, such as a power interruption or a "crash". An event record contains the time at which the system was previously booted, which starts the event period, the time the event occurred, which ends the period of "uptime", the reason for the event, and the time that the system was rebooted. If the system stopped as a result of a hang, the exact instant at which it stopped is not easily known; this time is obtained from SGI ESP Event Monitor (see _e_v_e_n_t_m_o_n_d(1M) for more details) if aaaammmmccccoooonnnnffffiiiigggg ttttiiiicccckkkkeeeerrrrdddd flag is configured. Events are grouped as either "Service Action" events, or "Unscheduled" events. Service Action events are controlled shutdowns, initiated by operators through _s_h_u_t_d_o_w_n(1M), _h_a_l_t(1M) and _i_n_i_t(1M)). For such controlled shutdowns, a (configurable) prompt is given to identify the reason for the shutdown. Unscheduled events include system panics, and system interrupts (power failures, power cycles, system resets etc.). Panics are identified as either due to hardware or due to software or due to unknown reasons. This distinction is based strictly on results of the FRU analyzer, if present. Availmon generates three types of reports: availability, diagnosis and pager. Availability reports consist of the system serial number, full hostname/internet address, the previous system start time, the time of the event, the reason for the event (the event code), uptime, start time (following the reboot), and a summary of the reason for the event where relevant. PPPPaaaaggggeeee 1111 aaaavvvvaaaaiiiillllmmmmoooonnnn((((5555)))) aaaavvvvaaaaiiiillllmmmmoooonnnn((((5555)))) Diagnosis reports include all data from an availability report, and additionally may contain the icrash analysis report, FRU analyzer result, important syslog messages, and system hardware/software configuration and version information. Important syslog messages include error messages and all messages logged by sysctlrd and syslogd, since the last reboot. Duplicated messages are eliminated even if not consecutive; the first such message is retained with its time stamp, and the number of duplicated messages and the last time stamp are appended. System software version information is limited to version output for the operating system and installed patches. Pager reports are intended for "chatty pagers", and include only the system hostname, a brief description of the reason for the event, and the summary, if present. Availability information for the local system is always permanently stored in SGI ESP database with the help of _e_s_p_l_o_g_g_e_r(1). Files in /var/adm/avail are maintained by availmon and should not be deleted, modified, or moved. CCCCOOOONNNNFFFFIIIIGGGGUUUURRRRAAAATTTTIIIIOOOONNNN Once availmon is installed, "registration" is required before availmon reports are automatically distributed, and if desired, other options may also be configured. Registration of a system can normally be accomplished simply by enabling the flag aaaauuuuttttooooeeeemmmmaaaaiiiillll using aaaammmmccccoooonnnnffffiiiigggg aaaauuuuttttooooeeeemmmmaaaaiiiillll oooonnnn. There's no default email distribution. There are several other configuration options that can prove useful. One is to configure sending availmon reports from one or more systems to a standard system administrator email alias. This provides real-time notification of system activity. Another similar option is to configure availmon pager reports for real-time notification to "chatty" pagers. Or, availmon diagnostic reports may be sent to a local support office, or to a system administrator for detailed evaluation. To perform those adjustmentd of Email distribution, just use aaaammmmccccoooonnnnffffiiiigggg aaaauuuuttttooooeeeemmmmaaaaiiiillll....lllliiiisssstttt. Availmon can also generate periodic status reports that indicate that a system is still running and "registered" to send email reports. This is controlled by the NNNNuuuummmmbbbbeeeerrrr ooooffff ddddaaaayyyyssss bbbbeeeettttwwwweeeeeeeennnn ssssttttaaaattttuuuussss uuuuppppddddaaaatttteeeessss configuration value, which defaults to 7777 days. Such reports are sent by the eventmond, so they are sent only if the aaaammmmccccoooonnnnffffiiiigggg ttttiiiicccckkkkeeeerrrrdddd configuration flag is oooonnnn. NNNNOOOOTTTTEEEE::::That option is now deprecated in favor of an eventmond command-line flag -_n. Even where sending of availmon reports is not enabled, local system availability data is always maintained, and _R_e_p_o_r_t_s->_A_v_a_i_l_a_b_i_l_i_t_y option can be chosen from SGI ESP User Interface to produce statistical or event detail reports for the local system. PPPPaaaaggggeeee 2222 aaaavvvvaaaaiiiillllmmmmoooonnnn((((5555)))) aaaavvvvaaaaiiiillllmmmmoooonnnn((((5555)))) REPORT VIEWING The _R_e_p_o_r_t_s->_A_v_a_i_l_a_b_i_l_i_t_y option of User Interface reviews saved availability report information and provides statistical and event history reports. Also, eeeesssspppprrrreeeeppppoooorrrrtttt aaaavvvvaaaaiiiillllaaaabbbbiiiilllliiiittttyyyy command ASCII interface can be used. By default, it processes the availability data on the local system. It can also process aggregate site data; that is, an accumulation of availmon data from different systems. Please refer to SGI ESP User Guide on how to setup your system to collect availability data from different systems. FFFFIIIILLLLEEEESSSS /var/adm/avail/.save/lasttick uptime in seconds since Jan 1, 1970 (written by eventmond) /var/adm/crash/* location temporary availmon files: availreport.*, diagreport.*, pagerreport.*, /etc/init.d/availmon _i_n_i_t script that logs start/stop and initiates notification SSSSEEEEEEEE AAAALLLLSSSSOOOO espreport(1), esplogger(1), Mail(1), amconfig(1M), amreceive(1M), amsyslog(1M), amtime1970(1M), configmon(1M), eventmond(1M), halt(1M), hinv(1M), icrash(1M), init(1M), shutdown(1M), versions(1M), syslogd(1M), syslog(3C), esp(5). RRRREEEEFFFFEEEERRRREEEENNNNCCCCEEEESSSS SGI Embedded Support Partner User Guide. PPPPaaaaggggeeee 3333